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Abstract 

We study the compressed sensing (CS) signal estimation problem where an input is measured via a linear matrix 
multiplication under additive noise. While this setup usually assumes sparsity or compressibility in the observed signal 
during recovery, the signal structure that can be leveraged is often not known a priori. In this paper, we consider 
universal CS recovery, where the statistics of a stationary ergodic signal source are estimated simultaneously with 
the signal itself. We focus on a maximum a posteriori (MAP) estimation framework that leverages universal priors 
such as Kolmogorov complexity and minimum description length. We provide theoretical results that support the 
algorithmic feasibility of universal MAP estimation through a Markov Chain Monte Carlo implementation. We also 
include simulation results that showcase the promise of universality in CS, particularly for low-complexity sources 
that do not exhibit standard sparsity or compressibility. 

Index Terms 

Compressed sensing, Kolmogorov complexity, MAP estimation, Markov chain Monte Carlo, universal coding. 

I. Introduction 

Since many systems in science and engineering are approximately linear, linear inverse problems have attracted 
great attention in the signal processing community. A signal x e is recorded via a linear operator under additive 
noise: 

y^(^X + Z, (1) 

where $ is an M x N matrix and z e denotes the noise. The goal is to estimate x from the measurements 
y given knowledge of $ and a model for the noise z. When M <^ N, the setup is known as compressed sensing 
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(CS) and the estimation problem is commonly referred to as recovery or reconstruction; by posing a sparsity or 
compressibility requirement on the signal and using this requirement as a prior during recovery, it is indeed possible 
to accurately estimate x from y [2,3]. 

While in CS the acquisition can be designed independently of the particular signal prior through the use of 
randomized measurement matrices $, the majority of existing recovery algorithms require knowledge of the sparsity 
structure of x, i.e., the choice of transformation that renders a sparse coefficient vector for the signal. A second, 
separate class of Bayesian CS recovery algorithms poses a probabilistic prior for the coefficients of a; in a known 
transform domain [4-6]. In contrast, complexity -based regularization methods can use arbitrary prior information 
on the signal model and come with analytical guarantees, but are only computationally efficient for specific signal 
models, such as the independent-entry Laplacian model [7]. As a fourth alternative, there exist algorithms that can 
formulate dictionaries that yield sparse representations for the signals of interest when a large amount of training 
data is available [8-10]. 

In certain cases, one might not be certain about the structure or statistics of the source prior to recovery. It would 
nonetheless be desirable to formulate algorithms to estimate x that are agnostic to the particular statistics of the 
signal. Therefore, we shift our focus from the standard sparsity or compressibility priors to universal priors [11, 
12]. Such concepts have been previously leveraged in the Kolmogorov sampler universal denoising algorithm [13], 
which minimizes Kolmogorov complexity [14-17].' Related approaches based on minimum description length 
(MDL) [20-22] minimize the complexity of the estimated signal with respect to some class of parametric sources. 

Unfortunately, while MDL may provide a suitable algorithmic recovery framework for parametric sources, alter- 
native approaches for non-parametric sources based on Kolmogorov complexity are not computable in practice. To 
address this computational problem, we confine our attention to stationary ergodic sources and develop an algorithmic 
framework for universal signal estimation in CS systems. Our framework leverages the fact that for stationary ergodic 
sources, both the per-symbol empirical entropy and Kolmogorov complexity converge asymptotically almost surely 
to the entropy rate of the source [23]. We aim to minimize the empirical entropy; our minimization is regularized 
by introducing a log likelihood for the noise model, which is equivalent to the standard least squares under additive 
white Gaussian noise. Other noise distributions are readily supported. 

We make several contributions toward our universal CS framework. First, we show that for well-behaved sources 
the maximum a posteriori (MAP) risk over a specific quantization grid converges asymptotically in the signal 
length to the MAP risk of the non-quantized estimator. Second, we apply this quantization equivalence result to 
a MAP estimator driven by a universal prior, providing a finite-computation universal estimation scheme. Third, 
we propose a recovery algorithm based on Markov chain Monte Carlo (MCMC) to approximate this estimation 
procedure. Fourth, we prove that for a sufficiently large number of iterations the output of our MCMC recovery 

'a recent paper by Jalali and Maleki [18], developed independently from and appearing simultaneously with our work [1, 19], considered the 
performance of Kolmogorov complexity minimization for CS recovery from measurements corrupted by noise of bounded magnitude. 
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Fig. 1. Universal MCMC, EMBG [24], and £i -nonn nainimization recovery results for the four-state Markov switching source of 
length N — 10000 as a function of the number of Gaussian random measurements M for different SNR values. For M > 4000, 
MCMC significantly outperforms £i-norm minimization and EMBG, which fail due to the signal not being sparse in a fixed basis. 
Each point in the graph represents median performance over 25 signal and random measurement matrix draws. 

algorithm converges to the correct MAP estimate. Fifth, we identify computational bottlenecks in the implementation 
of our MCMC estimator and show approaches to reduce their complexity. Sixth, we develop an adaptive quantization 
scheme that tailors a set of representation levels to minimize the quantization error within the MCMC iterations 
and that provides an accelerated implementation. Finally, we showcase encouraging experimental results that show 
recovery performance for a variety of types of signal structures (or statistics) that meets or exceeds that of popular 
state-of-the-art algorithms. 

To showcase the potential of our universal estimation approach. Fig. 1 illustrates recovery results from Gaus- 
sian measurement matrices for a four-state Markov source of length N = 10000 that generates the pattern 
+ 1,+1,— 1,— 1,+1,+1,— 1,— 1... with 3% errors in state transitions, resulting in the signal switching from -1 to 
H-1 or vice- versa either too early or too late. While it is well known that sparsity -promoting recovery algorithms 
such as ^i-norm minimization can recover sparse sources from linear measurements, the aforementioned switching 
source is not sparse in a foreknown basis, rendering such algorithms not applicable. In contrast, our MCMC 
recovery algorithm estimates this source with high fidelity when the signal to noise ratio (SNR) is sufficiently large 
and a moderate number of measurements M is available. Our experimental results in Section VII also show some 
challenges faced by MCMC recovery of certain classes of sparse signals; we identify properties of the algorithm that 
may cause these challenges, which remain to be addressed. While the practical value of our MCMC algorithm may 
be reduced due to its high computational cost, our approach provides a starting point toward further performance 
gains of more practical algorithms for computing the universal MAP estimator Furthermore, our experiments will 
show that the performance of MCMC is comparable to and sometimes significantly better than existing state-of- 
the-art algorithms. 

This paper is organized as follows. Section II provides background content. Section III overviews MAP estimation 
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and quantization, and Section IV introduces universal MAP estimation. Section V formulates a concrete MCMC 
algorithm for universal MAP estimation, Section VI describes our proposed adaptive quantization scheme for 
MCMC, and Section VII presents initial experimental results. We conclude in Section VIII. The proofs of our main 
theoretical results appear in the appendix. 

II. Background and related work 

A. Compressed Sensing 

Consider the noisy measurement setup via a linear operator (1). The input vector x E is generated by a 
stationary and ergodic source X, and must be estimated from y and $. The distribution fx that generates X is 
unknown. The matrix $ G M^^^^ has independent and identically distributed (i.i.d.) Gaussian entries, $(m, n) ^ 
A/^(0, jj )^ These moments ensure that columns of the matrix have unit norm on average. For concrete analysis, 
we assume the noise z E M.^^ to be i.i.d. Gaussian, with mean zero and known variance cr^ for simplicity. Other 
noise distributions are readily supported. 

We focus on the setting where Af, N oo and the aspect ratio is positive, 

6^ lim — > 0. (2) 

N^oo N 

Similar settings have been discussed in the literature, e.g., [25,26]. Since x was generated by an unknown source, 
we must search for an estimation mechanism that is agnostic to the specific distribution fx- 

B. Quantization 

Following the approach of Baron and Weissman [27], we define the set of data-independent reproduction levels 
for quantizing x as 

7^= 0,i,...l, (3) 



7 7 

where 7 = [In(A^)] . As N increases, TZ will quantize a; to a greater resolution. In Section III, we will show that 
under suitable conditions on fx, performing maximum a posteriori (MAP) estimation over the discrete alphabet 
TZ asymptotically converges to the MAP estimate over the continuous distribution fx- This reduces the complexity 
of the estimation problem from continuous to combinatorial. 

C- Related work 

For a scalar channel, e.g., $ = / and y = x + z, Donoho proposed the the Kobnogorov sampler (KS) for 
denoising [13], 

xks — argmini4r(w) s.t- Ww — y\\^ < t, (4) 

w 

^In contrast to our analytical and numerical results, the algorithm presented in Section V is not dependent on a particular choice for the 
matrix 
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where K{x) denotes the Kolmogorov complexity of x, defined as the length of the shortest input to a Turing 
machine [28] that generates the output x and then halts, and r = Na'^ controls for the presence of noise. It can 
be shown that K{x) asymptotically captures the statistics of the stationary ergodic source X, and the per-symbol 
complexity achieves the entropy rate H = H{X), i.e., limjv^oo j^K{x) = H almost surely. Noting that universal 
lossless compression algorithms [11, 12] achieve the entropy rate for any discrete-valued finite state machine source 
X, we see that these algorithms achieve the per-symbol Kolmogorov complexity almost surely. 

Donoho et al. expanded the KS to the linear CS measurement setting y — $x but did not consider measurement 
noise [29]. A recent paper by Jalali and Maleki [18], which appeared simultaneously with our work [1, 19], provides 
an analysis of a modified KS suitable for measurements corrupted by noise of bounded magnitude. Inspired by [29], 
we estimate x from noisy measurements y using the empirical entropy as a proxy for the Kolmogorov complexity. 

Separate notions of complexity-based regularization have also been shown to be well suited for denoising and CS 
recovery [7,30-32]. For example, minimum description length (MDL) [20-22,32] provides a universal framework 
composed of classes of parametric signal models for which the signal complexity can be defined sharply. In 
principle, complexity-based regularization approaches can yield MDL-flavored CS recovery algorithms that are 
universal for parametric classes of sources [7,30,31]. An alternative universal denoising approach computes the 
universal conditional expectation of the signal [19]; we leave this for future work. 

III. MAP ESTIMATION AND DISCRETIZATION 

In this section, we assume for exposition purposes that we know the input statistics fx- Given the measurements 
y, the MAP estimator for x has the form 

a;MAP = argmax/x(w)/Y[x(y|w). (5) 

W 

Because z is i.i.d. Gaussian with mean zero and known variance f7|, 

/y|x(2/k)=cie-^^ll^-*-ll', (6) 

where ci = (27rtTD^^^/^ and C2 = are constants and || • || denotes the Euchdean norm. Plugging into (5) and 
taking log likehhoods, we obtain 

xmap = argmin*-^(u;), 

w 

where ^''^(•) denotes the objective function (risk) 

^-^H 4 -ln(/x(w)) + C2||j/-$w|p; (7) 

our ideal risk would be '^-^[xmap)- 

Instead of performing continuous-valued MAP estimation, we optimize for the MAP in the discretized domain 
72.^. We begin with two technical conditions on the input. 
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Condition 1: We require that the probability density fx has bounded support, i.e., there exists A = [a^min, a^max] 
such that fx{x) = for x ^ A^. 

Condition 2: We require that the probability density has bounded derivatives 



<P (8) 



for X € A'^, where is the derivative with respect to entry ?i of .x, rt G {1, . . . , N}, and p > is a constant. 
A limitation of the data-independent reproduction levels (3) is that TZ has infinite cardinality. Thanks to Condition 1, 
for each value of 7 there exists a constant C3 > such that a finite set of reproduction levels 

7e.^(-^,-^;^,...Xl (9) 



7 7 7 

will quantize the range of values A to the desired accuracy. This finite quantization step reduces the complexity 
of the estimation problem from infinite to combinatorial. In fact, if we fix C3 = 1, then the range TZp covers all 
symbols Xi with overwhelming probability for sufficiently large N . The resulting reduction in complexity is due to 
the structure in 7?.^ and independent of the particular statistics of the source X. 

Let xmap be the quantization bin in {TZp)^ nearest to xmap- Condition 2 ensures that a small perturbation 
from Xmap to xmap does not change \n{fx{-)) by much. We use this fact to prove that 'i'^ (xmap) is sufficiently 
close to (xmjip) asymptotically. 

Theorem 1: Let $ G R^^^^ be an i.i.d. Gaussian measurement matrix where each entry has mean zero and 
variance jj. Suppose that Conditions 1 and 2 hold, the aspect ratio 5 > in (2), and the noise z e M^^ is i.i.d. 
zero-mean Gaussian with finite variance <j^. Then for all e > 0, the quantized estimator xhjap satisfies 

'^^{xMAp) < "^^(xmap) < "S^ixMAp) + Ne 

almost surely as iV — > 00. 

Theorem 1 is proved in Appendix A; it shows that in terms of the MAP objective function, xmap is near-optimal 
almost surely asymptotically. Thus, it is natural to perform the MAP optimization directly in the quantized domain: 

a^MAP (T^-F )- arg min (10) 

From Theorem 1, we have 

-^S^^ixMAp) < ^^(xMApiTlF)) < ^^(XMAP) < ^^ixMAp)+Ne (11) 

almost surely asymptotically for any e > 0. 

Discrete probability space: Now that we have set up a quantization grid (TZp)^ for x, we convert the distribution 
fx to a probability mass function (PMF) px over {TZp)^ ■ Let 
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and define the PMF px{') as 



We now have 



/ N A fxjw) 

px[w) = — . (12) 



min (~ln{px{w)) + C2\\y - ^w\\'^) 

= *^^(xMAp(7^F))+ln(/7^^). (13) 

The additive constant \n{f-jzp) can be ignored during the MAP optimization over {TZp)'^ , so that (10) gives the 
MAP estimate of x over (TZf)^ due to (13). 

IV. Universal MAP estimation 

In [13], Donoho showed that for the scalar channel y ~ x + z: (i) the Kolmogorov sampler xks (4) is drawn from 
the posterior distribution Px\Y{x\y)', and (ii) the mean square error (MSB) of this estimate Ex.z,<!'[\\y — xksW'^] 
is equal to twice the minimum mean squared error (MMSE). 

Given that Theorem 1 shows that the risk penalty due to quantization vanishes asymptotically in N, we now 
describe a universal estimator for CS over a quantized grid. Consider a universal prior pu [11,12] that might 
involve Kolmogorov complexity [14-16], e.g., pu{w) = 2~^^'"'\ or MDL complexity with respect to some class of 
parametric sources [20-22]. The universal prior has the fortuitous property that for every stationary ergodic source 
X and fixed e > 0, there exists some minimal Nq{X, e) such that 

- \n{pu{w)) < - \n{px{w)) + eN (14) 

for all w ^ {TZp)^ and N > Nq{X, e) [11, 12]. We optimize over an objective function (risk) that incorporates pu 
and the presence of additive white Gaussian noise in the measurements:^ 

^ - Hpuiw)) +C2\\y- ^w\\^, (15) 

resulting in 

x^jAp=SiTg min "il^iw). (16) 

Based on the results by Donoho for the scalar channel [13], we now present a conjecture on the quality of the 
reconstruction x^j^p; experimental evidence to assess this claim is presented in Section VII. 

Conjecture 1: Assume that the conditions of Theorem 1 hold. Then for all e > 0, the mean squared error of the 
universal MAP estimator x^jj^p satisfies 

Ex,z^<i> [\\x-xIjap\\^] <2Ex,z^^ [\\x- Ex[x\y,m^] + Ne 

for sufficiently large N. 

'The formulation in (16) corresponds to a Lagrangian relaxation of the approach studied in [18]. 
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We note in passing that when the SNR is low, the MMSE could be almost as large as the energy of the signal, 
\\x\\2- In these problems, achieving twice the MMSE means achieving a mean squared error that could be greater 
than ||a;|||, which is not a useful guarantee. In contrast, the mean square error achieved by xmmse will be smaller 
than ||a:||2, which is a much better signal estimate than xmap- This excellent performance at low SNR is particularly 
useful in finance (cf. [33] and references therein). 

V. Algorithmic approach 

Although the results of the previous section are theoretically appealing, a brute force optimization of x^jj^p 
is computationally intractable. Instead, we propose an algorithmic approach based on Markov chain Monte Carlo 
(MCMC) methods [34]. Our approach is reminiscent of the framework by Weissman et al. and Yang et al. for lossy 
data compression [27,35,36]. 

A. Universal compressor 

We propose a universal lossless compression formulation following the conventions of Weissman et al. [27,35]. 
Our goal is to characterize — \og{pi;{w)), cf. (15). To do so, we use empirical entropy, which for stationary ergodic 
sources converges to the per-symbol entropy rate almost surely [23]. 

To define the empirical entropy, we first define the empirical symbol counts: 

ng{w,am ^ \{te[q + l,N]:wlzl=a,w' =p}l (17) 

where q is the context depth [12,37], /? e TZp, en E {TZfY, and wl is the string comprising symbols i through j 
within w. We now define the order q conditional empirical probability for the context a as 

Pq{w,a)[j3]^ ^ -( rr^y (18) 

and the order q conditional empirical entropy, 

Hq{w)^-^ E n,Ka)[/?]log2(p,(u;,«)[^]), (19) 

where the sum is only over nonzero counts and probabilities. 

Allowing the context depth q = qN ^ o(log(A^)) to grow slowly with N, various universal compression 
algorithms can achieve the empirical entropy Hq{-) asymptotically [11, 12,37]. On the other hand, no compressor 
can outperform the entropy rate. Additionally, for large N, the empirical symbol counts with context depth q provide 
a sufficiently precise characterization of the source statistics. Therefore, Hq provides a concise approximation to 
the per-symbol coding length of a universal compressor. 
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B. Markov chain Monte Carlo 

Having approximated the coding length, we now describe how to optimize our objective function. We define the 
energy '^^•'{w) in an analogous manner to '^'^{w), using Hq{w) as our universal coding length (15): 

-^"•^{w) = NHq{w)+c4y-^wf, (20) 

where C4 = C2 log2(e). The minimization of this energy is analogous to minimizing [w). The Boltzmann PMF 
is then defined as 

p^H ^ -J-exp(-s*^'H), (21) 

where s > is inversely related to temperature in simulated annealing and C,s is a normaUzation constant. 
Ideally, our goal is to compute the globally minimum energy solution 

^f/AP = arg rmn J>"^{w). (22) 

We use a stochastic MCMC relaxation [34] to achieve the globally minimum solution in the limit of infinite 
computation. In MCMC, the space w E (TZf)^ is analogous to a statistical mechanical system, and at low 
temperatures the system tends toward low energies. 

MCMC samples from the Boltzmann PMF (21) using a Gibbs sampler, in each iteration, a single element Wn 
is generated while the rest of w, lu^" = {wi : n ^ i}, remains unchanged. We denote by u)"~^/3it;^]^ the 
concatenation of the initial portion of the output vector w""^, the symbol /3 e TZp, and the latter portion of the 
output w^_^_l. The Gibbs sampler updates Wn by resampling from the PMF: 

( I \n^ CXp(-gVi/g^K-^a<+i)) 

Psiwn ~ a\w^ ) = ^ ^7 r (23) 

1 



Hbenp [N/\Hq{w, n, 6, a) + C4A(i(w, 71, 6, a)] 



where 



is the change in empirical entropy Hq{w) (19) when Wn = a is replaced by 6, and 

M{w,nAa) ^ \\y-<f{wr'bw:!+,)r - \\y - ^w^'aw^^+iW (24) 

is the change in \\y — $it;|p when Wn ~ a is replaced by b. The maximum change in the energy within an iteration 
of Algorithm 1 is then bounded by 

Ao = max max max \NAHa(w,n,b,a)+CAAd(w,n,b,a)\. (25) 

^ l<n<N we(nF)'^ a,b£nF yv / \ /I 

Note that X is assumed bounded (cf. Condition 1) so that (24-25) are bounded as well. During the execution of 
the algorithm, we set a sequence of decreasing temperatures that takes into account the maximum change given in 
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(25): 

St ^ \n{t)/{cNAg) for some c> 1. (26) 

At low temperatures, i.e., large st, a small difference in energy 'i'^''{w) drives a big difference in probability. 
Therefore, we begin at a high temperature where the Gibbs sampler can freely move around (TZf)'^- As the 
temperature is reduced, the PMF becomes more sensitive to changes in energy (21), and the trend toward w 
with lower energy grows stronger In each iteration, the Gibbs sampler modifies Wn in a random manner that 
resembles heat bath concepts in statistical physics. Although MCMC could sink into a local minimum, we decrease 
the temperature slowly enough that the randomness of Gibbs sampling eventually drives MCMC out of the local 
minimum toward the globally optimal x^^^p. 

Pseudocode for our MCMC approach appears in Algorithm 1. We refer to the processing of a single location 
as an iteration and group the processing of the N different entries of w, randomly permuted, into super-iterations. 
During the minimization process, we refer to the approximation as w. 

The following theorem is proven in Appendix B, following the framework espoused in [35]. 

Theorem 2: Let X be a stationary ergodic source that obeys Condition 1. Then the outcome w"^ of Algorithm 1 
after r iterations obeys 



lim ^""(w'') = min ^""(v) = ^"^ ( x"\p\ ■ 



Theorem 2 shows that Algorithm 1 matches the best-possible performance of the universal MAP estimator, which 
we believe to be close to the performance of the MMSE estimator (Conjecture 1 in Section IV). To gain some 
insight about the convergence process of MCMC, suppose that at iteration t the energy of the algorithm output 
vj/^g (u>) has converged to a steady state (see Appendix B for details on convergence). We can then focus on the 
probabilistic ratio 



i.e., the ratio by which Pst{w) is smaller than Psti^MAp) ^'^^ arbitrary w <E (TZp)^ ■ When we square the number 
of super-iterations from t to t^, the inverse temperature is doubled from st to 2st, and the corresponding ratio at 
time is 

P2st{w) ^ cxp(~2st^^^(w)) ^ / cxp(~St^^-iH) \ ^ ^ / p,^ jw) \ ^ 
P2sAxmap) exp(-2st«'^5(a;^/4p)) Vexp(-st1'^<J (xf/^p)) / XPsA^fiAp) / 
That is, between super-iterations t and P the probability ratio between a higher energy w E (TZp)^ to some 
minimal energy x^j^p is also squared. 

The practical value of our proposed MCMC approach may be reduced due to its high computational cost, 
dictated by the number of iterations r required for convergence to the universal MAP estimator Nonetheless, this 
approach provides a starting point toward further performance gains of more practical algorithms for computing the 
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Algorithm 1 MCMC for Universal CS 
1: Inputs: Initial point x* G K", reproduction alphabet TZp, noise variance cr^, number of super-iterations r, 

temperature constant c > 1 

2: Outputs: Approximation w of x^jj^p 

3: Initialize w by quantizing x* to (TZf)^ 

4: Compute nq(w, a)[/3], V a G (7?.f)'', P e TZf 

5: for t = 1 to r do // super- iteration 



6: ln{t)/{cNAq) II St, cf. (26) 

1: Draw permutation {1, . . . , N} at random 

8: for i' = 1 to iV do // iteration 
9: Let n be component t' in permutation 

10: for all /3 in 7^i? do // possible new Wn 

11: Compute AiJg(i(j, n, /3, u>„) 

12: Compute Ac?(w, n, /3, Wn) 

13: Compute Ps{Wn = 

14: end for 

15: Generate w„ using ps(-|w\") // G;M5 

16: Update nq{w,a)[l3], V a G (TeF)^ P (^Tlp 

17: end for 



18: end for 

19: return w 



universal MAP estimator; furthermore, our experiments in Section Vll will show that the performance of MCMC 
is comparable to and sometimes significantly better than existing state-of-the-art algorithms. 

C. Computational challenges 

Studying the pseudocode of Algorithm 1, we recognize that Lines 11-13 must be implemented efficiently, as 
they run rN\R.F\ times. Lines 11 and 12 are especially challenging. 

For Line 11, a naive update of AHq{w,n,b,a) has complexity 0{\TZf\'^^^), cf. (19). To address this problem, 
Jalali and Weissman [35] recompute the empirical conditional entropy in 0{q\TZF\) time only for the 0{q) contexts 
whose corresponding counts are modified [35]. The same approach can be used in Line 16, again reducing 
computation from 0(|7^F|'^+l) to 0(g|7^F|)■ 
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We now focus on computation of Ad{'w, n, b, w„) in Line 12. Define v ^ y — <I>w. From (24) we get 

M 



m—1 
M 



= [2u,„$,„„(w„ -b) + ($,„„(w„ - fe))^] 



m—1 



where <!>„ is column n of $. By pre-computing the inner product {v, <&„) and squared £2 norm ||$„|p. Line 12 can 
be implemented in constant time. Seeing that the inner product and squared £2 norm require 0{M) time, which 
is aggregated over \TZf\ calls per iteration to Line 12, Ad{w,Ti,b, a) requires 0{Nr{M + \TZf\)) time in total. 
Combined with the computation for Line 11, and utilizing that M 3> qITZfI'^ in practice, the entire runtime of our 
algorithm is 0{rMN). 



While Algorithm 1 is a first step toward universal CS, it suffers from a large number of reproduction levels \TIf\- 
In order to meet a target performance level, N must be large enough to ensure that TZf quantizes a broad enough 
range of values of R finely enough to represent the estimate x well. For finite N, estimation performance using 
the reproduction levels (9) could suffer from high computational complexity. 

To estimate better with finite N, we utilize reproduction levels that are adaptive instead of the fixed levels in 
TZf- To do so, instead of w G (TZf)^ , we optimize over u e Z^, where \Z\ < \TZf\- The new alphabet Z does 
not directly correspond to real numbers. Instead, there is an adaptive mapping A : Z ^ M.. Considering the energy 
function (20), we now compute the empirical symbol counts nq{u, a)[fi], order q conditional empirical probabilities 
Pq{u,a)[f3], and order q conditional empirical entropy Hq{u) using u G Z^ , a G Z'', and P G Z, cf. (17), (18), 
and (19). Similarly, we use ||y — $^(it)|p instead of ||y — $wjp, where A{u) is the straightforward vector extension 
of A. These modifications yield an adaptive energy function 



where [$^(u)],„ denotes the m*'* entry of the vector ^A{u). The optimal mapping depends entirely on y, $, and 
u. From a coding perspective, describing Aoptiu) requires Hq{u) bits for u and |Z|61oglog(iV) bits for Aopt to 
match the resolution of the nonadaptive alphabet TZp, with 6 > 1 an arbitrary constant [27]. The resulting coding 
length defines our universal prior. 



VI. Adaptive reproduction levels 



^^-{u) ^ NHq{u) + c4y - '^A{u)\ 



We choose Aopt to optimize for squared £2 error. 
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A. Optimization of reproduction levels 

We now describe the optimization procedure for Aopt, which must be computationally efficient. Write 



N 



T{A) ^ \\y - ^A{V)\\1 = J2 Vrn - E '^^nAM 



m— 1 \ n—1 



For T(^) to be minimal, we need zero-valued derivatives 
dTiA) ' 

■ = —I 

m— 1 \ n—1 

0, V /? e z. 



dA{^) 



/3} 



Define the location sets 

Lji^ {n:\<n<N,Un = 
for each (3 ^ Z, and rewrite the derivatives of T(^), 
dT{A) 



dAiP) 

Let the per-character averaged column values be 

Mm/3 — E '^mn, 

for each m £ {1, . . . , Af } and /3 G Z. We desire the derivatives to be zero, cf. (27): 

M / \ 

m=l \ AeZ / 

Thus, we must satisfy the system of equations, 

M / \ 

P-ynIi 

for each (3 G Z. We can write the right hand side of each of these equations as 



M M / \ 

m=l m=l \\eZ / 



m=l \\ez ) 



Pmf3 



M 



AeZ m=l 

for each (3 £ Z. The system of equations can be described in matrix form as 

n e 



EM sr^Al 



A/ 



A/ 



Z^m=l Dml^mlix 



(27) 



(28) 



(29) 
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Note that by writing /i as a matrix with entries indexed by row m and column /3 given by (28), we can write fl as 
a Gram matrix, ft ~ A'^A*' ™d we also have Q = fr^y. The optimal A can be computed as a |Z| x 1 vector 



if the \Z\ X \Z\ matrix is invertible. We note in passing that numerical stability is improved by regularizing fl. 
Note also that 



which can be computed in 0(M|Z|) time instead of 0{MN). 

B. Computational complexity 

Pseudocode for the adaptive reproduction level estimation appears as Algorithm 2. We discuss computational 
requirements for each line of the pseudocode that is run within the inner loop. 

• In Line 12, the differences in empirical conditional entropy can be computed in 0((7|2^|) time as demonstrated 
by Jalali and Weissman [35]. 

• In Line 13, we update for m € {1, . . . , M} in 0{M) time. 

• Line 14 updates Q,. Because we only need to update 0(1) columns and 0(1) rows, each such column and 
row contains 0(|Z|) entries, and each entry is a sum over 0{M) terms, we need 0(A/|Z|) time. 

• Line 15 requires to invert Q, in 0(|Zp) time. 

• Line 16 requires 0{M\Z\) time, cf. (30). 
« Line 17 requires 0(|Z|) time. 

In practice we typically have M ^ |Zp, and so the aggregate complexity is 0{rMN\Z\), which is greater than 
the computational complexity of the fixed reproduction level Algorithm 1 by a factor of 0(|Z|). 



We implemented universal MCMC estimation (Algorithm 2) in Matlab and tested it using several stationary 
ergodic sources. Although not all of our sources obey Condition 2, they nonetheless illustrate the performance of the 
algorithm. Our code is available for download at http://people.engr.ncsu.edu/dzbaron/software/UCS_BaronDuarte/. 
We chose an alphabet of size \Z\ = 7 for all sources tested. For each source, inputs x of length TV = 10000 
were generated. Each such x was multiplied by a random Gaussian matrix $. We then added measurement 
noise z whose variance was selected to ensure that the signal to noise ratio (SNR) was 2, 6, 10, or 14 dB. 
We compared the performance of universal MCMC estimation starting from an initial estimate x* ~ <l>^y to that of 
£i-norm minimization recovery and the expectation-maximization Bernoulli-Gaussian approximate message passing 
algorithm (EMBG) of [24]; for each value of M and SNR, we plot the median signal-to-distortion ratio of each 




(30) 



VII. Numerical results 
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Algorithm 2 MCMC with Adaptive Levels 
1: Inputs: Initial point x* E W, adaptive alphabet Z, noise variance <t^, number of super-iterations r, temperature 

constant c > 1 

2: Outputs: Approximation A{u) of x^jj^p 

3: Initialize u and A from x* and Z 

4: Compute ng{u, a)[/3], V a € Z', ^ € Z 

5: Initialize 

6: for f = 1 to r do // super- iteration 



7: s ^ ln(i)/(ciVAg) // s St, cf. (26) 

8: Draw permutation {1, . . . , N} at random 

9: for = 1 to TV do // iteration 

10: Let n be component t' in permutation 

11: for all /3 in Z do // possible new u„ 

12: Compute AHq{u,n, 13, Un) 

13: Compute /im^,V TO g {1, Af} 

14: Update // 0(1) row5 one/ columns 

15: Compute ylopt // invert Vl 

16: Compute |ly - $yt(u""^/3M^_^i)||2 

17: Compute Ps(u„ = /3|u\") 

18: end for 

19: Un M,i // save previous value 

20: Generate u„ using II Gibbs 

21: Update riq (•)[•] at 0{q) relevant locations 

22: Update y m, e {Un,Un} 

23: Update ^l II 0{1) rows and columns 

24: end for 



25: end for 

26: return A{u) 



algorithm over 25 draws of x, $, and z; the median yields somewhat smoother plots than the mean signal-to- 
distortion ratio, which has the same flavor Although MCMC was slower than the ^i-norm minimizer [38] and the 
EMBG algorithm used in our simulations, its runtime of several minutes was nonetheless reasonable. 

Results for the switching source (Fig. 1) were highlighted in Section 1, where it is seen that MCMC performs 
well while i!i-norm minimization and EMBG fail due to the input not being canonically sparse. 
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Fig. 2. Universal MCMC, EMBG, and li -norm minimization recovery results for a two-state Markov source with nonzero entries 
of value +1 as a function of the number of Gaussian random measurements M for different SNR values. MCMC outperforms both 
li-norm minimization and EMBG due to the two quantization levels required to encode the observed distribution. 

Next, we examine three sources for sparse signals whose sparse supports (the locations of the nonzero entries) 
are generated by a two-state Markov state machine (nonzero and zero states). The transition from zero to nonzero 
state for adjacent entries has probability g|g, while the transition from nonzero to zero state for adjacent entries 
has probability 10%; these parameters yield 3% sparsity on average. The three sources considered differ in the 
distribution of the nonzero values, and £i-norm minimization offers reasonable recovery performance. 

For our first two-state Markov source, the nonzeros are set to a constant value +1; such signal structure has 
low entropy. Figure 2 shows that MCMC consistently outperforms ^i-norm minimization as expected; MCMC 
also outperforms EMBG due to only two quantization levels being necessary to accurately model the observed 
distribution. 

For our second two-state Markov source, the nonzeros are drawn from a Rademacher distribution, i.e., uniform 
over { — 1, +1} (Fig. 3). Here, universal MCMC estimation performs well for low to moderate SNR, outperforming 
both ^i-norm minimization and EMBG for sufficiently large values of AI. Surprisingly, MCMC fails for high SNR. 
To explain this behavior, we note that for high SNR the constant C2 in (6) is large, and the Gibbs sampler (23) 
is strongly motivated to minimize the quadratic term ||y — $.4(it)|p while accommodating larger values for the 
empirical entropies Hq, which in turn allows for more complex sources to be used in the estimate. Such behavior 
within the universal MCMC algorithm will push its estimates away from the low-complexity priors that we seek 
to promote, and may also appear for other sources once the SNR is sufficiently high. We plan to further study this 
behavior in light of the parameters of the Gibbs sampler, including the choice of initial point x* . 

For our third two-state Markov source featuring nonzero values following a uniform distribution [/[0, 1] (Fig. 4), 
MCMC performs poorly. The problem we saw during execution is that the adaptive alphabet A spends many of the 
representation levels in Z on zero-valued entries of the signal, and only one level for the nonzeros, which leads to 
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Fig. 3. Universal MCMC, EMBG, and £i -nonn minimization recovery results for a two-state Markov source with nonzero entries 
drawn from a Rademacher (±1) distribution as a function of the number of Gaussian random measurements M for different SNR 
values. MCMC outperforms £i-norm minimization and EMBG for most cases shown, with the surprising exception of SNR = 
UdB. 

poor quantization. We noticed that the quantization step for the initial estimate x* in Step 3 of Algorithm 2 pushes 
us away from convergence. We conjecture that a key hurdle is the calculation of an initial quantizer that is suitable 
for a variety of sources. 

Our last source generates the entries of the signal by drawing them independently from a Bernoulli distribution, 
where each Xn was +1 with probability 3% and zero otherwise. For this source, MCMC outperforms £i-norm 
minimization and EMBG, except when the number of measurements M is low. We also compared the performance of 
the two algorithms to the MMSE achievable in the Bayesian regime with known statistics [39]; similar computations 
were performed by Guo and Wang [26]. Interestingly, the squared error achieved by MCMC is thrice the MMSE. 
We are left to wonder whether one could approach the mean squared error bound given in Conjecture 1 in the limit 
of infinitely many super-iterations. 



VIII. Conclusions 

This paper provides an initial step towards the formulation of computationally feasible universal algorithms 
for signal estimation from linear measurements. Here, universality denotes the property that the algorithm need 
not be informed of the probability distribution for the recorded signal prior to acquisition; rather, the algorithm 
simultaneously builds estimates both of the observed signal and its distribution. Inspired by the Kolmogorov 
sampler [13] and motivated by the need for a computationally tractable framework, our contribution focuses on 
stationary ergodic signal sources and relies on a maximum a posteriori (MAP) estimation algorithm, for which 
it is possible to establish asymptotic near-optimality with high probability. The algorithm is then implemented 
via a Markov Chain Monte-Carlo (MCMC) formulation that is proven to be convergent in the limit of infinite 
computation. While the practical impact of an MCMC-based approach is mitigated due to its high computational 



17 



m25 
o 

2 20 



tr 
o 

m 15 



MCMC 




EMBG 




ell-1 min. 




SNR=14dB 


• 


SNR=10clB 


▼ 


SNR=6dB 





SNR=2dB 




1000 2000 3000 4000 5000 6000 
Number of measurements M 



7000 



Fig. 4. Universal MCMC, EMBG, and li -norm minimization recovery results for a two-state Markov source with nonzero entries 
drawn from a uniform distribution U[0, 1] as a function of the number of Gaussian random measurements M for different SNR 
values. The performance of universal MCMC estimation is hampered likely due to the continuous-valued nature of the source 
studied. 



cost, our experiments have shown that its performance is comparable to and in some cases significantly better than 
existing state-of-the-art algorithms, with a significant advantage observed for sources that are non-sparse. 

Our expectation is that these initial results will spur additional work to improve the computational cost of imple- 
menting universal MAP estimation from linear measurements, including techniques that accelerate the convergence 
of MCMC. An alternative approach to implement universal MAP estimation could be obtained by moving from 
MCMC to other optimization algorithm such as belief propagation. 



Appendix A. Proof of Theorem 1 
The lower bound "if^ (xmap) < (xmap) is trivial, because xmap is the MAP solution. For the upper bound, 

\\y 

< \\y 

< \\y 

< \\y 

< \\y 

where the i^o norm in (31) isolates the largest absolute difference between Xmap and Xmap, which is 0(log(A^)^^) 
(3); and for the distribution we assumed for the matrix $ we obtain ||$|| = 1 + (5^°-^ = 0(1) almost surely 
asymptotically [40], where S is the aspect ratio (2). Observe that ||y — <I>.Tj\//ip|| = 0{^/N), because the per- 
element signal to noise ratio of the measurement process (1) is a function of az and therefore finite. Substituting 



^xmapW 

^XmApW + \\^{XMAP - XMAP)\\ 

^xmapW + ||^>||||a;A/AP - xmap\ 



'^'xmapW + ll^-ll VA^Il-TM^p - xmapWoo (31) 
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Fig. 5. Universal MCMC, EMBG, and £i -nonn minimization recovery results for a source witli i.i.d. Bernoulli entries with nonzero 
probability of 3% as a function of the number of Gaussian random measurements M for different SNR values. MCMC outperforms 
£i -norm minimization and EMBG for a sufficiently large number of measurements M. 



this observation into (32), 



< 



\y - <^XMAP\? 

{^\y-^XMAp\\ +0 
\y~<^XMAp\? + 0\ 



log(iV) 
N 



\og{N) 



\\y-^XMAp\\^ + o{N) 



(33) 



almost surely asymptotically. That is, discretization does not change the noise hypothesized by the MAP estimator 
by much. 

We now show that ln(/x (Jmap)) is similar to h\{fx[xMAp))- Owing to our reproduction levels (3), 



\xmap - XMAph < O 



N 



Because the log derivatives of fx are bounded, cf. Condition 2, 



oiN). 



I MfxixMAp)) - ^T^{fx{xMAp))\ 
< P\\XMAP - ^M^pIIi = 0{N). 



(34) 



Combining (7), (33), and (34), we see that (xmap) — (xmap) ~ o(A^) almost surely asymptotically. □ 



Appendix B. Proof of Theorem 2 

Our proof mimics a very similar proof presented in [35] for lossy source coding; we include all details for 
completeness. The proof technique relies on mathematical properties of non-homogeneous (e.g., time-varying) 
Markov Chains (MCs) [41]. Through the proof, S = {TZp)^ denotes the state space of the MC of codewords 
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generated by Algorithm 1, with size \S\ = \TZf\^- We define a stochastic transition matrix from S to itself 
given by the Boltzmann distribution for super-iteration t in Algorithm 1. Similarly, 7r(() defines the stable-state 
distribution on S for satisfying Trjj-jPfj) = ttjj). 

Definition 1: [41] Dobrusliin's ergodic coefficient of a MC transition matrix P is denoted by 5{P) and defined 

as 

5{P) ^ , max -pjlli, 
where Pi denotes the row of P, 1 < n < N. 

From the definition, < S{P) < 1. Moreover, the ergodic coefficient can be rewritten as 

N 

(5(P) = 1- mill y^min.{p^k,Pjk), (35) 

l<i,'j<N ^ ^ 

- fc=l 

where pij denotes the entry of P at the i*^ row and j*'' column. 

We group the product of transition matrices across super-iterations as P(ti->t2) = OtLti ^(t)- There are two 
common characterizations for the stable-state behavior of a non-homogeneous MC. 

Definition 2: [41] A non-homogeneous MC is called weakly ergodic if for any distributions ^ and ly over the 
state space S, and any ti G N, 

limsupj^^^||/^P(4^^t2) - i^P(^ti^t2)\\i = 0- 

Similarly, a non-homogeneous MC is called strongly ergodic if there exists a distribution tt over the state space S 
such that for any distribution p, over S, and any ti e N, 

limsup(^„,^||^P(t^^f^) - 7r|li = 0. 

We will use the following two theorems from [41] in our proof. 

Theorem 3: [41] A MC is weakly ergodic if and only if there exists a sequence of integers < < ^2 < ■ • ■ 
such that 

oo 

^(l-<5(P(,,^,.,,)))=oo. 

i=l 

Theorem 4: [41] Let a MC be weakly ergodic. Assume that there exists a sequence of probability distributions 

{T^{t)}iZi on the state space S such that tti^.-^P^^i-^ = Hfj-y Then the MC is strongly ergodic if 



oo 

E 

t=l 



lk(t) - 7r(t+l)l|l < OO- 



The rest of proof is structured as follows. First, we show that the sequence of stable-state distributions for the 
MC used by Algorithm 1 converges to a uniform distribution over the set of sequences that minimize the energy 
function as the iteration count t increases. Then, we show using Theorems 3 and 4 that the non-homogeneous 
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MC used in Algorithm 1 is strongly ergodic, which by the definition of strong ergodicity implies that Algorithm 1 
always converges to the stable distribution found above. This imphes that the outcome of Algorithm 1 converges 
to a minimum-energy solution as t — > oo, completing the proof of Theorem 2. 

We therefore begin by finding the stable-state distribution for the non-homogeneous MC used by Algorithm 1. 
At each super-iteration t, the distribution defined as 

( ^ A exp(-5t^^^(u.)) ^ 1 

^'^'^"^^ J:.^S^M-St^"<^)) E.e5Cxp(-St(vI/«^(z)-vI/ff.H))- ^^"^ 
satisfies 7r(j')P((-) = 7r((), cf. (23). We can show that the distribution Tr^^) converges to a uniform distribution over 
the set of sequences that minimize the energy function, i.e., 

hm TTmiw) = < (37) 

where "H = {w G S s.t. 'i'^''{w) = min^gs (z)}. To show (37), we will show that 7r(t)(u') is increasing for 
w E H and eventually decreasing for w E . Since for w gH and z G S we have '^^•'{z) — 'i'^''{w) > 0, for 
<i < t2 we have 

zeS zes 
which together with (36) implies 7r(-tj)(w) < 7r(j,)(w). On the other hand, if w G , then 

. ^ 1 

E.:*^,(.)>*«,Mexp(-.st(1'^'(z) - *^'H)) +E.:*«,(^)<*«,Mexp(-.st(4'«'(z) - ^"^{w)))' 

(38) 

For sufficiently large St, the denominator of (38) is dominated by the second term, which increases when st 
increases, and therefore ttj-^) (w) decreases for w £ TiP as t increases. Finally, since all sequences w E H have the 
same energy '^^''{w), it follows that the distribution is uniform over the symbols in H. 

Having shown convergence of the non-homogenous MC's stable-state distributions, we now show that the non- 
homogeneous MC is strongly ergodic. The transition matrix of the MC at iteration t depends on the temperature 
St in (26) used within Algorithm 1. We first show that the MC used in Algorithm 1 is weakly ergodic via Theorem 3; 
the proof of the following Lemma is given in Appendix C. 

Lemma 1: The ergodic coefficient of P{t) for any t > is upper bounded by 

5 {Pit)) < 1 - exp(-stiVA,), 

where Ag is defined in (25). 

We note in passing that Condition 1 ensures that Ag is finite. Using Lemma 1 and (26), we can evaluate the sum 
given in Theorem 3 as 

rv^ rvy OO 



i=i i=i i=i 



21 



and therefore the non-homogeneous MC defined by {P(^t}}t^i is weakly ergodic. Now we can use Theorem 4 to 
show that the MC is strongly ergodic by proving that 

oo 

^W'^it) - '^{t+i)\\i < oo. (39) 
t=i 

Since we know from earlier in the proof that 7r(() (w) is increasing for w G H and eventually decreasing for w E H'~' , 

there exists a to G such that for any ti > to, 

ti ti ti 

t=to went=to w<^Ht=to 

= X ('^(ti+i)(^) -^(to)H) + X (^(«o)(«^) -^(ti+i)M) 

= lk(ti+i) - 7r(4„)|li < |l7r(tj+i)l|i + ||7r(,,„)||i = 2. 

Since the right hand side does not depend on t\, we have that X^t^i IKlt) ~ ""(t+i)!!! < This implies that the 
non-homogeneous MC used by Algorithm 1 is strongly ergodic, and thus completes the proof of Theorem 2. 

Appendix C. Proof of Lemma 1 

Let yi , 2/2 be two arbitrary sequences in S. The probability of transitioning from a given state to a neighboring 
state in an iteration within iteration t' of super-iteration t of Algorithm 1 is given by (23), and can be rewritten as 

exp (-Sf^'-^9(w*'~^awj^, J 

E,en. cxp (*^.K'-'K^+i) - 
exp(-stAq) 



> 



where 'I'^?-, (/(w^i w^+i) ~ miuagT^^ 4'^'' (w* ~^az/;^_|_j^). Therefore, the smallest probability of transition from 
yi to 2/2 within super-iteration t of Algorithm 1 is bounded by 

D ^ I N ^ A exp(-stA,) exp(-st7VAq) exp(-SiiVA5) 

y.zfn. ^(*)^^^"^^) ^ n = TTp^ = — ^ — ■ 

Using the alternative definition of the ergodic coefficient given in (35), 



6 (P(,)) = 1 - min V n^in(P(,)(z|2/i),P(,)(z|2/2)) < 1 - j^j^^EL^iXM = i _ exp(-s,7VA,), 
proving the lemma. 
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